Introduction

One in three adults do not get enough sleep everyday in the United States1. Sleep is extremely important for one’s quality of life, much like eating healthy. However, there is an increasing incidence of people not obtaining enough sleeping on a daily basis. It is known that sleep deficiency can lead to both chronic mental and physical health problems, including increased risk of heart disease and depression2. To better understand how sleep deprivation can affect one’s mental health, and as a student who often suffers from having sleep deficiency, I specifically wanted to look at how sleep deprivation is affecting cognition. Using the transcriptomic data generated by the authors of the paper Altered hippocampal transcriptome dynamics following sleep deprivation3, I investigate how sleep deprivation alters the expression of genes involved in learning and memory. There have been studies that show significant associations between sleep deprivation and neurological disorders4, and accordingly, I hypothesized that sleep deprivation significantly would alter the expression of key genes associated with cognitive function and memory.

Results

FastQC was run to check the quality of the fastq files. The results showed adapter contamination and this issue was resolved using TrimGalore. After running FastQC again after trimming, there were still a few warnings and failed tests. However, the overall quality of the samples was good enough to carry out subsequent analyses. Following quality control analyses, alignment of the samples was done using STAR. Majority of the reads were seen to be uniquely mapped and the scores were good, which allowed for differential gene expression (DGE) analysis.

DGE was done using DESeq2 and was followed up with gene ontology (GO) term enrichment analysis. This allowed for the identification of significant biological processes and cellular components that were affected by sleep deprivation. Less stringent alpha and p-values were used (0.1) as this is an exploratory analysis. The figures refered to in the following paragraphs are in the DGE Analysis and GSEA sections under Methods.

Using a Revigo TreeMap to visualize the significant GO terms, it was seen that processes involved in mRNA metabolic processes, protein dephosphorylation, and cellular response to growth factors were identified. These processes are highly related in brain and memory. In particular, neurons rely heavily on complex RNA splicing allowing for plasticity and learning. The misregulation of this is also associating with cognitive impairment. Growth factors are also very important in long-term memory formation and activate signaling pathways that regulate transcription. For the significant cell component terms, again regulation of transcription is highlighted as the spliceosomal complex, chromatin, and other transcription factor complexes were identified as significant. In particular, the CREB1 transcription factor is known to play a key role in memory formation. Chromatin and chromosomes also play an important role in memory consolidation and prevention of neuronal degeneration.

A GSEA analysis revealed activation of genes that play a part in regulation of lipid storage and forebrain neuronal development. This finding was particularly interesting as it appears that sleep deprivation significantly affects genes involved in forebrain neuron development, which plays an important role in learning and memory. The regulation of lipid storage is likely affected due to an inflammatory response induced by the sleep deprivation. The GSEA analysis also revealed suppression of pathways related to the eye and lens, which don’t seem to relate much with the question we are investigating but may be a downstream effect of disregulation in an different pathway.

Limitations of dataset

The sample size of this dataset is quite small (18 samples) and is collected at only one time point. Increasing the sample size and collecting more samples across different time points could enhance the differences between the condition groups.

Additionally, even after trimming, the FastQC results showed that there was quite a lot of sequence duplication and the per base sequence content was poor. Both these properties can reduce the statistical power making it harder to identify truly differentially expressed genes and their pathways.

One of the main limitations of this investigation is the dataset itself. Due to the characteristics of bulk RNA-seq, discerning cell types is not possible. Single-cell RNA-seq or spatial transcriptomics could potentially provide more statistical power by providing the ability to assess more cell-specific gene expression changes. Additionally, the amount of sleep deprivation could alter the resultant response drastically. To adress this, samples taken over a series of time points could provide more information about how the level of sleep deprivation affects the transcriptome. While interpreting the enriched pathways is a good way to infer the changes due to sleep deprivation, it would be better to have some more behavioral data assessing mice’s cognitive changes like a learning or memory test after a period of sleep deprivation.

The GO terms looked at comprise of all the GO terms available for the species. A more concentrated approach focused on gene sets specific to the brain could improve the analysis. Furthermore, to corroborate the changes in the transcript level, wet lab experiments can confirm the changes in these levels to further confirm these findings.

Future Directions

Methods

Data collection

The data for this project is obtained from the NCBI Gene Expression Omnibus (GEO) under the accession number GSE166831. The authors of the publication “Altered hippocampal transcriptome dynamics following sleep deprivation”5 from the Molecular Brain journal generated this data at the Genomics Division at the Iowa Institute of Human Genetics.

RNA extraction

Whole hippocampi were removed and flash frozen on dry ice from mice (Mus musculus). The samples were homogenized in Qiazol (Invitrogen) and using chloroform, were phase separated. This was followed by centrifugation at 14,000g for 15 minutes. Then, RNA was extracted using the RNeasy kit (Qiagen). Any DNA contamination was removed with RNase-Free DNase (Qiagen). Nanodrop 1 and Agilent Bioanalyzer were used to assess the quality of the samples after resuspending them in RNase-free water. Only samples that had an OD 260/280 and OD 260/230 ratio ~2.0 along with an RNA integrity number (RIN) of >8 were selected for subsequent library preparation.

Library Preparation

The Illumina TruSeq Stranded Total RNA sample prep kit was used to create the sequencing libraries and they were prepared for sequencing using standard Illumina protocols. Ribo-Zero Gold was used to enrich for mRNA through ribo-depletion as this method removes ribosomal RNA. Library concentrations were measured using the KAPA Illumina Library Quantification Kit (KAPA Biosystems). Paired-end sequencing (150 bp paired-end reads) was done in two lanes and 18 samples were sequenced in two batches.

Cell type

The mouse strain used was C57BL/6j and the entire hippocampus was used for this experiment. Accordingly, the total RNA is extracted from cells of hippocampal tissue samples, which mainly comprise of pyramidal neurons, granule cells, and mossy cells.

Experimental condition

The mice were either under the condition where their sleep was deprived by moving and tapping their cage (gentle handling method) or under the condition where their sleep was not disturbed.

Sequencing Platform

The Illumina HiSeq 4000 was used to do the sequencing.

Downloading the data

The FASTQ files are described here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA702096.

There are 18 samples for this project and their unique accession run IDs are obtained from the link above by clicking on the Accession List button, which downloads the list of IDs locally. I created a file PRJNA702096.txt containing this list of IDs, and the relevant FASTQ files were downloaded using SRA-toolkit from this list. Below is the script used to do this (on github: load_data.sh).

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=125G
#SBATCH --time=4:00:00

cat PRJNA702096.txt | xargs -I {} -P 16 bash -c 'echo -e "Downloading FASTQ file for sample: {} \n"; prefetch {}; fastq-dump --split-files --gzip {}/'

Quality Control

FastQC was run on the samples and from the output.

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=160G
#SBATCH --time=8:00:00

cat PRJNA702096.txt | xargs -n 1 -P 18 bash -c '
    run_fastqc() {
        sample_id=$1
        r1="fastqs/${sample_id}.fastq.gz"
        r2="fastqs/${sample_id}.fastq.gz"
        fastqc $r1 --extract -o fastqc_output/${sample_id}/
        fastqc $r2 --extract -o fastqc_output/${sample_id}/
    }
    run_fastqc "$0"'

mamba activate multiqc
multiqc fastqc_output/*

MultiQC was run to collate the results of the FastQC run on each sample and below are some plots generated from the same.

status checks before trimming
status checks before trimming
adapter content before trimming
adapter content before trimming
seq duplication before trimming
seq duplication before trimming

From the output it is evident that the fastq files contain adapter sequences and the adapters need to be trimmed. The per base sequence content test also failed for most samples, which is likely due to the adapter contamination. The test for sequence duplication levels also failed for many samples, suggesting that there are many copies of a single read from over-amplification. There are warnings about potential excessive overrepresented sequences. This could mean that there are sequence repeated in the genome or potential contamination. The adapter content test also failed, as expected from the other failed tests.

TrimGalore was used to solve the adapter contamination issue. Below is the script used (run_trim_galore.sh).

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=160G
#SBATCH --time=8:00:00

cat PRJNA702096.txt | xargs -n 1 -P 18 bash -c '
    run_trimming() {
        sample_id=$1
        r1="${sample_id}_1.fastq.gz"
        r2="${sample_id}_2.fastq.gz"
        trim_galore --paired --illumina --cores 4 -o trimmed/ $r1 $r2
    }
    run_trimming "$0"'

Script used to run FastQC after trimming is shown below (run_fastqc.sh).

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=160G
#SBATCH --time=8:00:00

cat PRJNA702096.txt | xargs -n 1 -P 18 bash -c '
    run_fastqc() {
        sample_id=$1
        r1="trimmed/${sample_id}_1_val_1.fq.gz"
        r2="trimmed/${sample_id}_2_val_2.fq.gz"
        fastqc $r1 --extract -o fastqc_output_trimmed/${sample_id}/
        fastqc $r2 --extract -o fastqc_output_trimmed/${sample_id}/
    }
    run_fastqc "$0"'
    
mamba activate multiqc
multiqc fastqc_output_trimmed/*

MultiQC was run again to collate the results and some plots from it are shown below.

status checks after trimming GC content after trimming

seq duplication after trimming
seq duplication after trimming
seq len distr after trimming
seq len distr after trimming
per base seq content after trimming
per base seq content after trimming

adapter content after trimming The adapter contamination was removed, which was the most important issue. The overrepresented sequences could be due to highly expressed sequences and the failed tests for sequence duplication levels could be due to overamplification. The variability in sequence length distribution is due to the trimming so it also not too concerning. While there were still some warnings and failed tests, since they seemed to apply to most samples they are not of high concern. The samples are ready to be aligned now.

Alignments

First, the genome and annotation data is obtained to create the index required to align the reads. The genome is downloaded from the UCSC Genome Browser: https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz The annotation is also downloaded from the UCSC Genome Browser: https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz

Using STAR, the index is built for the mm10 mouse strain.

The parameters used are:

  • –runMode: This parameter is set as genomeGenerate to build the genome index.

  • –runThreadN: This parameter is set to 16 to run faster (on 16 CPUs).

  • –genomeDir: This parameter is set to mm10_STARindex, which points to the directory where the index will be stored.

  • –genomeFastaFiles: This parameter points to the genome FASTA file, which is mm10.fa.

  • –sjdbGTFfile: This parameter points to the gene annotation file, which is mm10.gtf.

  • –sjdbOverhang: This parameter is set to 149 because the read length is 150 and the optimal spliced alignment is the read length - 1 = 150 - 1 = 149. This parameter allows more accurate detection of splice junctions with respect to the length of the read and is more suited for RNA-seq.

wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz"
gunzip mm10.fa.gz
wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz"
gunzip mm10.ncbiRefSeq.gtf.gz

STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir mm10_STARindex \
--genomeFastaFiles mm10_files/mm10.fa \
--sjdbGTFfile mm10_files/mm10.ncbiRefSeq.gtf \
--sjdbOverhang 149

Now that the index has been created, the alignment can be run. STAR is used to run the alignment as it is optimized for RNA-seq datasets whereas BWA is better used for DNA-seq like whole genome sequencing.

The parameters used are:

  • –runMode: This parameter is set as alignReads to align the reads to the genome.

  • –runThreadN: This parameter is set as 18 to run the alignment faster (on 18 CPUs).

  • –genomeDir: This parameter is set to mm10_STARindex, which points to the directory where the index is stored.

  • –readFilesIn: This parameter points to the paired-read files to be aligned SRR13720930_2_val_2.fq.gz and SRR13720930_2_val_2.fq.gz.

  • –readFilesCommand: This parameter is set to zcat to decompress the files as the alignment runs.

  • –outFileNamePrefix: This parameter specifies the prefix of the output files as , which isalignments/SRR13720930.to store the alignment output results in the directoryalignments`.

  • –outSAMtype: This parameter is set to BAM SortedByCoordinate to output the aligned reads as a BAM file and sorted by genomic coordinates.

  • –limitBAMsortRAM: This parameter increases the allocated memory to sort the BAM file to 100GB (100000000000).

Below is the script used to run all the alignments (run_alignments.sh).

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=100G
#SBATCH --time=4:00:00

for line in $(cat PRJNA702096.txt); do
  r1="trimmed/${line}_1_val_1.fq.gz"
  r2="trimmed/${line}_2_val_2.fq.gz"
  STAR --runMode alignReads --runThreadN 16 \
  --genomeDir ../project_old/mm10_STARindex/ \
  --readFilesIn $r1 $r2 \
  --readFilesCommand zcat \
  --outFileNamePrefix alignments/${line}/${line}. \
  --outSAMtype BAM SortedByCoordinate \
  --limitBAMsortRAM 100000000000
done

Alignments QC

To better identify samples by their condition group and batch I renamed them using the script below.

  • NSD refers to the condition group where there was no sleep deprivation induced in the mice.

  • SD refers to the condition group where the mice were sleep deprived.

  • batch1 and batch2 are the labels of the two batches.

#!/bin/bash

while IFS=, read -r run group; do
    old_dir="alignments/${run}/"
    new_dir="alignments/${group}/"
    for file in "$old_dir"*; do
        filename=$(basename "$file")
        extension="${filename#${run}}"
        new_filename="${group}${extension}"
        mv ${file} ${new_dir}${new_filename}
    done
done < sra_run_id_to_group.csv

rm -r alignments/SRR*

After running STAR, MultiQC was run to get an overview of the percentage of reads mapped and the alignment scores. Below are the plots generated.

star reads mapped star alignment scores To understand how the mapped reads are distributed over the different features in the genome RSeQC was run. Below is the script used to run it (run_rseqc.sh).

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=100G
#SBATCH --time=8:00:00

for SAMPLE in alignments/*/*.bam; do
  samtools index $SAMPLE
done

for SAMPLE in alignments/*/*.bam; do
    echo $SAMPLE >> bam_list.txt
done

for SAMPLE in alignments/*/*.bam; do
    SAMPLE_NAME=$(basename ${SAMPLE} .Aligned.sortedByCoord.out.bam)
    read_distribution.py -i ${SAMPLE} \
    -r ../project_old/mm10_files/mm10_RefSeq.bed > \
    rseqc_results/${SAMPLE_NAME}_read_distribution.txt
done

rseqc read distr Majority of the reads are exons, mainly CDS exons, which is good. There is also a large proportion of reads mapped to intronic regions. This could be due to gDNA contamination. However, due to alternative splicing, some introns may be retained, which could also contribute to the these mapped reads.

QoRTs was also run for further QC. QoRTs was chosen over RSeQC as it allows for comparisons to be made between batches and between the condition groups.

Below is the script used to run QoRTs (run_qorts.sh). The decoderFile.txt was manually generated (and is available to view on github) to create the MultiQC plots from the indivisual QoRTs output.

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=200G
#SBATCH --time=8:00:00

for SAMPLE in alignments/*/*.bam; do
    echo $SAMPLE >> bam_list.txt
done

cat bam_list.txt | xargs -n 1 -P 18 bash -c '
    run_qorts() {
        SAMPLE=$1
        SAMPLE_NAME=$(basename "$SAMPLE" .Aligned.sortedByCoord.out.bam)
        java -Xmx4G -jar /athena/angsd/scratch/mef3005/share/envs/qorts/share/qorts-1.3.6-1/QoRTs.jar QC \
        --stranded \
        --generatePdfReport \
        ${SAMPLE} \
        ../project_old/mm10_files/mm10.ncbiRefSeq.gtf \
        qorts_results/$SAMPLE_NAME/
    }
    run_qorts "$0"'

Rscript /athena/angsd/scratch/mef3005/share/envs/qorts/share/qorts-1.3.6-1/qortsGenMultiQC.R qorts_results/ decoderFile.txt multiqc/

Below are the plots generated after running the Rscript qortsGenMultiQC.R mentioned in the script above.

Comparing the two condition groups NSD and SD

qorts by group Comparing the two batches

qorts by batch From the plots generated, there don’t appear to be any batch effects nor any vast differences in alignment quality between the condition groups. The genebody coverage also looks consistent with no biases.

FeatureCounts

To identify the strand specificity of the data, the infer_experiment.py script from RSeQC was run using the code and the output is also described below.

# confirm that data is paired end and the strand specificity
conda activate rseqc
infer_experiment.py -r ../project_old/mm10_files/mm10_RefSeq.bed -i alignments/SD_batch1_1/SD_batch1_1.Aligned.sortedByCoord.out.bam

### Output
Reading reference gene model ../project_old/mm10_files/mm10_RefSeq.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled


This is PairEnd Data
Fraction of reads failed to determine: 0.1541
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0030
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8429
### 

From the values of the fraction of reads it is evident that the data has reversely stranded reads. FeatureCounts was run accordingly with the parameter -s 2. Below is the script used to run FeatureCounts.

#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=100G
#SBATCH --time=4:00:00

featureCounts -a ../project_old/mm10_files/mm10.ncbiRefSeq.gtf -o featureCounts_gene_results -s 2 -p -O -T 18 alignments/*/*.bam

The FeatureCounts summary is plotted below.

# Setup packages
suppressPackageStartupMessages({
  library(ggplot2)
  library(reshape2)
  library(DESeq2)
  library(magrittr)
  library(vsn)
  library(clusterProfiler)
  library(enrichplot)
  library(DOSE)
})
## Warning: package 'DOSE' was built under R version 4.4.3
fc_summary <- read.table("project/featureCounts_gene_results.summary", header=TRUE)

fc_summary_long <- melt(fc_summary, id.vars = "Status", variable.name = "Sample", value.name = "Count")
fc_summary_long$Sample <- sapply(strsplit(as.character(fc_summary_long$Sample), "\\."), `[`, 2)
fc_summary_long_selected_col <- fc_summary_long[fc_summary_long$Status %in% c("Assigned", "Unassigned_Ambiguity", "Unassigned_NoFeatures"),]

ggplot(fc_summary_long_selected_col, aes(x = Count, y = Sample, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "FC Summary - Gene", x = "No. of Reads", y = "Sample") +
  scale_fill_manual(values = c("Assigned" = "salmon", "Unassigned_NoFeatures" = "dodgerblue")) +
  theme(legend.position = "bottom")

Majority of the reads are assigned to genes, which is good, allowing for subsequent analyses.

Processing read counts

# load read counts
df_counts <- read.table("project/featureCounts_gene_results", header=TRUE)
orig_names <- names(df_counts)
names(df_counts) <- c(names(df_counts)[1:6], paste("NSD_batch1", c(1:4), sep = "_"), 
                      paste("NSD_batch2", c(1:5), sep = "_"), 
                      paste("SD_batch1", c(1:4), sep = "_"), 
                      paste("SD_batch2", c(1:5), sep = "_"))

row.names(df_counts) <- make.names(df_counts$Geneid)
cts_gene_sample <- df_counts[ , -c(1:6)]

df_coldata <- data.frame(condition = factor(gsub("_batch[0-9]+_[0-9]+", "", names(cts_gene_sample))),
                         batch = factor(gsub(".*_batch([0-9]+)_.*", "\\1", names(cts_gene_sample))),
                         row.names = colnames(cts_gene_sample))

dds <- DESeqDataSetFromMatrix(countData = cts_gene_sample,
                                         colData = df_coldata,
                                         design = ~ batch + condition)
df_rowdata <- df_counts[, 1:6]
rowData(dds) <- df_rowdata

# plotting counts
colSums(counts(dds)) %>% barplot(las=2, cex.names = 0.6)

# filtering
keep_genes <- rowSums(counts(dds)) > 0
dds <- dds[ keep_genes, ]

Normalization

dds <- estimateSizeFactors(dds) 
plot( sizeFactors(dds), colSums(counts(dds)),
      ylab = "library sizes", xlab = "size factors", cex = .6)

par(mfrow=c(1,2))
counts.sf_normalized <- counts(dds, normalized=TRUE)
boxplot(counts(dds), main = "read counts only", cex = .6, cex.axis = 0.7, las=2)
boxplot(counts.sf_normalized, main = "SF normalized", cex = .6, cex.axis = 0.7, las=2)

# hard to tell so transforming data
par(mfrow=c(1,2))
boxplot(log2(counts(dds) +1), notch=TRUE,
        main = "Non-normalized read counts",
        ylab ="log2(read counts)", cex = .6, cex.axis = 0.7, las=2)
boxplot(log2(counts(dds, normalized=TRUE) +1), notch=TRUE,
        main = "Size-factor-normalized read counts",
        ylab ="log2(read counts)", cex = .6, cex.axis = 0.7, las=2)

### Transformations

Log transformation

og_counts <- log2(counts(dds, normalized = FALSE) + 1)
assay(dds, "log_counts") <- log2(counts(dds, normalized = FALSE) + 1)
assay(dds, "log_norm_counts") <- log2(counts(dds, normalized=TRUE) + 1)

par(mfrow=c(1,2))
dds[, c("NSD_batch2_1","NSD_batch2_2")] %>%
assay(., "log_norm_counts") %>%
plot(., cex=.1, main = "NSD_batch2_1 vs. NSD_batch2_2")
dds[, c("SD_batch2_1","SD_batch2_2")] %>%
assay(., "log_norm_counts") %>%
plot(., cex=.1, main = "SD_batch2_1 vs SD_batch2_2")

MSD Plot

par(mfrow=c(1,1))
msd_plot <- vsn::meanSdPlot(assay(dds, "log_norm_counts"),
                            ranks=FALSE, plot = FALSE)
msd_plot$gg + ggtitle("Sequencing depth normalized log2(read counts)") +
  ylab("standard deviation")

The plot indicates signs of heteroskedasticity.

Reducing the dependence of the variance on the mean

dst_rlog <- rlog(dds, blind = TRUE)

par(mfrow=c(1,2))
plot(assay(dds, "log_norm_counts")[,1:2], cex=.1,
     main = "size factor and log2-transformed")
plot(assay(dst_rlog)[,1:2], cex=.1, main = "rlog transformed",
     xlab = colnames(assay(dst_rlog[,1])), ylab = colnames(assay(dst_rlog[,2])) )

rlog_norm_counts <- assay(dst_rlog)
assay(dds, "rlog_norm_counts") <- rlog_norm_counts
msd_plot <- vsn::meanSdPlot(assay(dst_rlog), ranks=FALSE, plot = FALSE)
msd_plot$gg + labs(title = "Following rlog transformation", x = "Mean", y = "Standard deviation") +
  coord_cartesian(ylim = c(0,3))

save.image(file = "RNAseqProject.RData")

DGE Analysis

Running differential gene expression analysis with DESeq2

dds$condition %<>% relevel(ref="NSD")

dds <- DESeq(dds)
DGE_results <- results(dds,
                       independentFiltering = TRUE,
                       alpha = 0.1)
gene_list <- DGE_results$log2FoldChange
names(gene_list) <- rownames(DGE_results)
gene_list <- sort(gene_list, decreasing = TRUE)
head(gene_list)
##  Gm41768  Mir3962  Olfr279   Gm3809     Gbx2  Gm46481 
## 3.947267 3.586359 3.464385 3.462940 3.361567 3.346090
DGE_genes <- subset(DGE_results, padj < 0.1)
DGE_genes <- DGE_genes[order(DGE_genes$padj), ]

# BiocManager::install("org.Mm.eg.db")
organism <- "org.Mm.eg.db"
res_go <- enrichGO(gene=rownames(DGE_genes),
                   universe=rownames(dds),
                   ont="ALL",
                   keyType="SYMBOL",
                   minGSSize = 3,
                   maxGSSize = 800,
                   pvalueCutoff = 0.1,
                   OrgDb = organism,
                   pAdjustMethod = "BH")

res_go_BP <- res_go[res_go$ONTOLOGY == "BP", ]
res_go_CC <- res_go[res_go$ONTOLOGY == "CC", ]
res_go_MF <- res_go[res_go$ONTOLOGY == "MF", ]


library(treemap) 

# Molecular Processes
write.table(res_go_BP[ , c("ID", "pvalue")],
            file="enrichGO_BP.txt", sep="\t",
            quote=FALSE, row.names=FALSE)

revigo.names <- c("term_ID","description","frequency","value","uniqueness","dispensability","representative");
revigo.data <- rbind(c("GO:0016071","mRNA metabolic process",1.6631731043434954,10.352461167546439,0.5521102682586524,0,"mRNA metabolic process"),
c("GO:0006397","mRNA processing",1.2042316958165515,9.924360639437078,0.4487960710842549,0.54229072,"mRNA metabolic process"),
c("GO:0008380","RNA splicing",0.9583080778010037,9.854967180073567,0.48506534024817216,0.65853797,"mRNA metabolic process"),
c("GO:0016311","dephosphorylation",0.8318929906132368,4.213855032370697,0.8270705940961626,0.05124893,"dephosphorylation"),
c("GO:0006470","protein dephosphorylation",0.49314683038536267,4.4670047162815685,0.7867396018797982,0.35750744,"dephosphorylation"),
c("GO:0043484","regulation of RNA splicing",0.19801584431962324,5.384009570734778,0.9520863016433264,-0,"regulation of RNA splicing"),
c("GO:0048024","regulation of mRNA splicing, via spliceosome",0.14237696102757744,3.9625207302104206,0.9520863016433264,0.26675993,"regulation of RNA splicing"),
c("GO:0071363","cellular response to growth factor stimulus",0.2033569802074241,4.234908329559581,0.8651593165358457,-0,"cellular response to growth factor stimulus"));

stuff <- data.frame(revigo.data);
names(stuff) <- revigo.names;

stuff$value <- as.numeric( as.character(stuff$value) );
stuff$frequency <- as.numeric( as.character(stuff$frequency) );
stuff$uniqueness <- as.numeric( as.character(stuff$uniqueness) );
stuff$dispensability <- as.numeric( as.character(stuff$dispensability) );

treemap(
  stuff,
  index = c("representative","description"),
  vSize = "value",
  type = "categorical",
  vColor = "representative",
  title = "Revigo TreeMap - Biological Processes",
  inflate.labels = FALSE, 
  lowerbound.cex.labels = 0,   
  bg.labels = "#CCCCCCAA",   # define background color of group labels
  position.legend = "none"
)

# Cellular Components
write.table(res_go_CC[ , c("ID", "pvalue")],
            file="enrichGO_CC.txt", sep="\t",
            quote=FALSE, row.names=FALSE)

revigo.names <- c("term_ID","description","frequency","value","uniqueness","dispensability","representative");
revigo.data <- rbind(c("GO:0005681","spliceosomal complex",0.7816247467092002,6.118114798565314,0.49221667694456733,0,"spliceosomal complex"),
c("GO:0000123","histone acetyltransferase complex",0.2530431390055417,3.2182289956014594,0.465869578312614,0.68396245,"spliceosomal complex"),
c("GO:0000785","chromatin",1.0610858619161054,4.589516765224887,0.5179220241149188,0.14407202,"spliceosomal complex"),
c("GO:0016607","nuclear speck",0.1457268168659327,5.387355180092352,0.6917690924756614,0.26216776,"spliceosomal complex"),
c("GO:0034506","chromosome, centromeric core domain",0.005334998877728837,3.2679377378207874,0.6165546458725504,0.50045979,"spliceosomal complex"),
c("GO:0090575","RNA polymerase II transcription regulator complex",0.41859778612518506,3.3285583547613395,0.4542966171182016,0.64141886,"spliceosomal complex"),
c("GO:1990589","ATF4-CREB1 transcription factor complex",0.007253901654610725,4.213763512565655,0.5370404870933506,0.43609566,"spliceosomal complex"));

stuff <- data.frame(revigo.data);
names(stuff) <- revigo.names;

stuff$value <- as.numeric( as.character(stuff$value) );
stuff$frequency <- as.numeric( as.character(stuff$frequency) );
stuff$uniqueness <- as.numeric( as.character(stuff$uniqueness) );
stuff$dispensability <- as.numeric( as.character(stuff$dispensability) );

treemap(
  stuff,
  index = c("representative","description"),
  vSize = "value",
  type = "categorical",
  vColor = "representative",
  title = "Revigo TreeMap - Cellular Component",
  inflate.labels = FALSE, 
  lowerbound.cex.labels = 0,
  bg.labels = "#CCCCCCAA",
  position.legend = "none"
)

Lastly, no significant molecular functions were identified.

GSEA

gse <- gseGO(geneList=gene_list,
             ont ="ALL",
             keyType = "SYMBOL",
             minGSSize = 3,
             maxGSSize = 800,
             pvalueCutoff = 0.2, # 0.05
             verbose = FALSE,
             OrgDb = organism,
             pAdjustMethod = "BH")
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (10.41% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)

Discussion

The first problem encountered was from warnings and failed tests from the FastQC results. While trimming the adapter sequences did remove the adapter contamination issue, there still remained sequence duplication and few overrepresented sequences along with per base sequence content. However, because these properties were not specific to a batch or condition group, they are not extremely harmful to the analyses. However, they do reduce the statistical power.

The number of GO terms enriched were not very high with a p-value cutoff of 0.05. As this is an exploratory study, I set a more lenient cutoff of 0.1. However, even with a less stringent cutoff, the number of enriched terms were not high. Due to this, the results were not as informative.

Similarly, I further increased the p-value cutoff to 0.2 for the GSEA analysis, but still did not see a large number of pathways activated or suppressed.

Due to the samples being collected in two batches, there was a higher risk for batch effects. Fortunately, from the QoRTs results it is deduced that there are not significant differences between the batches.

In addition to the small number of enriched GO terms, a more focused set of terms to look for would have made the analysis much more focused. In the future, running these analyses again with a more specific set of GO terms would be more informative.

Key Datasets

Dataset Description
FASTQ files Downloaded, replaced by trimmed FASTQs
FastQC results Analyzed
Trimmed FASTQ files Used for alignment
Alignments Analyzed
RSeQC results Analyzed
QoRTs results Analyzed
FeatureCounts results Analyzed, extract counts for DGE
Counts dataframe Replaced by processed counts
Counts normalized/transformed Used for DGE
DGE genes Used for GO analysis
GO term enrichment results Used for REVIGO Treemap
GSEA results Analyzed

References


  1. Centers for Disease C, Prevention. Perceived insufficient rest or sleep among adults - United States, 2008. MMWR Morb Mortal Wkly Rep. 2009;58(42):1175–9.↩︎

  2. https://www.nhlbi.nih.gov/health/sleep-deprivation↩︎

  3. Gaine, M. E., Bahl, E., Chatterjee, S., Michaelson, J. J., Abel, T., & Lyons, L. C. (2021). Altered hippocampal transcriptome dynamics following sleep deprivation. Molecular brain, 14(1), 125. https://doi.org/10.1186/s13041-021-00835-1↩︎

  4. Bishir M, Bhat A, Essa MM, Ekpo O, Ihunwo AO, Veeraraghavan VP, et al. Sleep Deprivation and Neurological Disorders. Biomed Res Int. 2020;2020:5764017.↩︎

  5. Gaine, M. E., Bahl, E., Chatterjee, S., Michaelson, J. J., Abel, T., & Lyons, L. C. (2021). Altered hippocampal transcriptome dynamics following sleep deprivation. Molecular brain, 14(1), 125. https://doi.org/10.1186/s13041-021-00835-1↩︎